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Abstract 

Predictive sparse coding algorithms recently have demonstrated impressive performance on a variety 
of supervised tasks, but they lack a learning theoretic analysis. We establish the first generalization 
bounds for predictive sparse coding. In the overcomplete dictionary learning setting, where the dictionary 
size k exceeds the dimensionality d of the data, we present an estimation error bound that is roughly 
0(\/ dk/m + ^/s / {fim)) . In the infinite-dimensional setting, we show a dimension-free bound that is 
roughly 0{k^/{\i^/7n)). The quantity (i. is a measure of the incoherence of the dictionary and s is 
the sparsity level. Both bounds are data-dependent, explicitly taking into account certain incoherence 
properties of the learned dictionary and the sparsity level of the codes learned on actual data. 
Keywords: Statistical learning theory, luckiness, data-dependent complexity, dictionary learning, sparse 
coding, LASSO 

1 Introduction 

Learning architectures such as the support vector machine and other linear predictors enjoy strong theo- 
retical properties and and many empirical successes (Steinwart and Christmann, 2008; Kakade et al., 2009; 
Scholkopf and Smola, 2002), but a learning-theoretic study of many more complex learning architectures 
is lacking. Predictive methods based on sparse coding recently have emerged which simultaneously learn a 
data representation via a nonlinear encoding scheme and an estimator linear in that representation (Bradley 
and Bagnell, 2009; Mairal et al., 2010, 2009). A sparse coding representation z e M'' of a data point 
X e M'' is learned by representing x as a sparse linear combination of k atoms Dj e M'' of a dictionary 
D = (Di, . . . , D/<) e R'^^'^. In the coding x w Sj=i ^J^J^ but a few zj are zero. 

Predictive sparse coding methods such as Mairal et al. (2010)'s task-driven dictionary learning recently 
have achieved state-of-the-art results on many tasks, including the MNIST digits task. Whereas standard 
sparse coding minimizes an unsupervised, reconstructive £2 loss, predictive sparse coding seeks to minimize 
a supervised loss by optimizing a dictionary D and a predictor which takes encodings to D as input. It 
empirically has been observed that sparse coding can provide good abstraction by finding higher-level rep- 
resentations which are useful in predictive tasks (Yu ct al., 2009). Intuitively, the power of prediction-driven 
dictionaries is that they pack more atoms in parts of the representational space where the prediction task is 
more difficult. Despite the empirical successes of predictive sparse coding methods, it is unknown how well 
they generalize in a theoretical sense. 

In this work, we develop what to our knowledge are the first generalization error bounds for predictive 
sparse coding algorithms; in particular, we focus on ii-regularized sparse coding. Maurcr and Pontil (2008) 
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and Vainsencher et al. (2011) previously established generalization bounds for the classical, reconstructive 
sparse coding setting. There, the objective is to learn a representation of data by coding the data to a 
dictionary of atoms such that the data can be reconstructed with low error using only the coded data and 
the dictionary. Extending their analysis to the predictive setting introduces certain difficulties, the most 
salient being a seemingly unavoidable demand that the learned encoder be stable with respect to dictionary 
perturbations. Our analysis therefore is intimately tied to encoder stability properties. 
The sparse encoder's stability is characterized by properties central to sparse inference: 

• The sparsity level s - the number of non-zeros in an encoding of a point. 

• The s-incoherence \Xs ~ the square of the minimum singular value among all s-column subdictionaries 
of D. 

• The coding margin - a measure of a coordinate's sign stability. 

Since the sparsity level and the coding margin depend on the particular training sample, our learning bounds 
will depend on the maximum sparsity level and the minimum coding margin observed on the training sample. 
The need for encoder stability may be the price one pays for absconding from the (quite stable) prison of 
^2-regularization. As in the luckiness frameworks of Shawe- Taylor et al. (1998) and Herbrich and Williamson 
(2002), our learning bounds depend on a posteriori-ohseivahle properties related to the training sample and 
the learned hypothesis; hence we will need to guarantee that these properties hold with high probability over 
a second, ghost sample. 

We provide learning bounds for two core scenarios in sparse coding: the overcomplete setting oi k > d 
and the infinite-dimensional setting where d ^ k ot d is even infinite. Both bounds hold provided that m 
is larger than 3 times the inverse of the permissible radius of perturbation, a function linear in the coding 
margin, the sparsity level, the inverse of the s-incoherence, and the £i-regularization parameter A. 

Our contributions are: 

1. A forward stability result for the LASSO which holds under mild conditions. This result implies 
conditions for support preservation under dictionary perturbations. 

2. In the overcomplete setting, a learning bound on the estimation error for predictive sparse coding that 
is essentially of order \J~^ + (Corollary 1). 

3. In the infinite-dimensional setting, a learning bound on the estimation error for predictive sparse coding 
that is independent of the dimension of the data; the bound is essentially of order ^^^^ (Corollary 2). 

Whereas in the overcomplete setting the bound depends on the observed sparsity level and the coding margin 
on the training sample, in the infinite-dimensional setting the bound depends on these quantities as measured 
on both the training sample and a second, unlabeled sample not used for training. Both learning bounds 
contain a factor of k, and the optimal order of k is unknown — in both the predictive setting and the 
reconstructive setting. In addition to providing guarantees in the case of simultaneous dictionary and linear 
estimator learning (true predictive sparse coding), the presented bounds are quite general in that they also 
apply to the heuristic, two-stage algorithm where one is given a labeled training sample, learns a dictionary 
reconstructively (not using the labels) to fix the sparse codes, and then learns a linear estimator using the 
same training sample but with the labels as well. 

The next section introduces the reconstructive and predictive forms of sparse coding. Section 3 sets 
up notation and presents two results: the Sparse Coding Stability Theorem (Theorem 1) and a useful 
symmetrization lemma. The overcomplete and infinite-dimensional settings are covered in Sections 4 and 5 
respectively. In Section 6 we compare the results to recent learning bounds for unsupervised sparse coding. 
To allow the paper to flow smoothly, we provide proof sketches in the main paper and leave detailed proofs 
to the appendix. 
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2 Sparse coding, reconstructive and predictive 



Suppose points Xi have been drawn from a probabihty measure Px over B^d, the unit ball in M''. 

The sparse coding problem is to represent each point x,- as a sparse linear combination of k basis vectors, or 
atoms Di, . . . , Dk. The atoms form the columns of a dictionary D £ V, where I? is a space of dictionaries 

V :— (Bjjd)'' and D,- = {Dj Df)^. We will use this definition of T> from here on out. In this work, we 

use rBjjd to denote the origin-centered ball in M.'^ scaled to radius r. 
It will be useful to frame sparse coding in terms of an encoder cpo: 

(Pd(x) ;=argmin||x- Dz||2 + £,(z), (1) 

z 

where £,(•) is a sparsity-inducing regularizcr, or to be colorfully concise, a sparsifier. 



Reconstructive sparse coding. The reconstructive sparse coding objective is 

min E ||x - D(Pd(x)||2 + £,(cpd(x)), 

DeVx^Px 

where Px is a probability measure on Bjjd . Generalization bounds for the empirical risk minimization (ERM) 
variant of this objective have been established: Maurcr and Pontil (2008) showed an 0{k/^m) bound that 
is independent of the dimension d; this is useful when d ^ k, as in general Hilbert spaces. Vainscnchcr 
et al. (2011) handled the overcomplete setting, producing a bound that is 0{Vdl< / ^/m) as well as fast rates 
of Oidk/m). 

Predictive sparse coding. As in Mairal et al. (2010), the predictive setup minimizes a supervised loss 
with respect to a representation and an estimator linear in the representation. Define a space of linear 
hypotheses W :— rB^k, for r > 0. Given a probability measure P on Sjjd x y, with 3^ C M in regression and 
3^ = {—1,1} in binary classification, the predictive sparse coding objective is 

min E 4>{y,{w,(pD{x))) + e{w). (2) 

DeV.weW {x,y)r^P 

where ©{■) is a regularizcr often taken to be proportional to the squared ^2-iiorm. 

Other formulations exist, some of which wield a separate dictionary for each class (Mairal et al., 2008); we 
do not consider such formulations here for multiple reasons. First, they do not naturally extend to regression 
problems. Also, when the number of classes is large, both the computational and sample complexities of 
learning a dictionary for each class becomes prohibitive. 



Encoder stability. The choice of the sparsifier £, seems to be pivotal both from an empirical perspective 
and a theoretical one. Bradley and Bagnell (2009) used a differentiable approximate sparsifier based on the 
KuUback-Leibler divergence. The sparsifier is approximate because true sparsity does not result, although 
it encourages low ii norms; nevertheless, true sparsity is a necessity in certain applications and also aids 
in some theoretical arguments. The most popular sparsifier is £,(•) — A|| • ||i. Notably, || • ||i is the tightest 
convex lower bound for the £o "norm": |{/ : x,- ^ 0}|. 

From a stability perspective the ii regularizcr regrettably is not well-behaved in general. Indeed, from 
the lack of strict convexity it is evident that each x need not have a unique image under cpo. It also is 
unclear how to analyze the class of mappings (po, parameterized by D, if the map changes drastically under 
small perturbations to D. Hence, we will begin by establishing sufficient conditions under which cpo is stable 
under perturbations to D. 

In this work, we analyze the ERM variant of (2) with 0{w) — ^\\w\\2: 

^ m ^ 

Since this objective is not convex, we will present bounds on the estimation error that hold uniformly over 
certain random subclasses of hypotheses. 
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3 Definitions, notation, and foundational results 



Some definitions and notation will be useful. Let z be an iid random sample of m points, where each 

Zi — (x/.y,), X,- G Bjjd and y,- G y. Let the spaee of targets, 3^, be a bounded subset of M (regression) or 

{—1,1} (binary classification). Let x" be a second, unlabeled iid sample of m points x(' x^; this sample 

will be used only in the infinite-dimensional setting. Also, let z' be an iid ghost sample of m points with the 
same distribution as z. 

A predictive sparse coding hypothesis function f is fully specified by a choice (D, w) £ T> x W, yielding 
— Vd{x))- We often identify f using the notation f = {D, w). The function class is the set of 
such hypotheses. When provided a training sample z, the hypothesis returned by the learner will be referred 
to as iF. Note that iF is random, but f becomes a fixed function upon conditioning on z. 

Throughout this work, (|) : 3^ x R ^ [0, b] will be a bounded loss function (fa > 0) that is /.-Lipschitz in its 
second argument. For f lE T, define : y the loss-composed function /<t>(y,x) 4)(y, f{x)). 

Let 4) o be the class of such functions induced by the choice of J- and c}). We use the notation 

Pf= E f{x) Pf^= E Hy,f{x)) 

for the expectation operator P and the notation 

P^f= ^(^') Pz = - V Hyi , f{x,)) 

i=i 1=1 



for the empirical measure Pz associated with sample z. 
Define lasso(A, D,x) as the optimization problem 



LASSO(A, D, x) EE min„gRk ||x - D(x\\l + A||a||i; 

call the argmin (Pd{x) = {{((>d{x))i {'PD{x))k)^ (A is fixed and hence withheld from the notation). Let 

M+ be the set of positive reals and [n] := {1 n}, for n G N. For s G [k], any d > s, and D £ V, define 

|is(D) as the the minimum squared singular value, taken across all matrices formed by choosing s distinct 
columns of D. We often use the overloaded definition \i.s{f) ■= l^s(D) where f = (D, w). The index set A is 
the set of non-zero indices of (Pd(x); with some abuse of notation A always is defined in terms of the most 
recently used dictionary D and data point x. Index sets induce coordinate projections in the following way: 

if z G M^ then = (z^^ za^a\)- ^ G M'', define supp(t) := {/ G [k] : t; ^ 0}. 

Finally, an epsilon-cover refers to the concept of an e-cover but not a specific cover. All epsilon-covers 
of spaces of dictionaries use the metric induced by the operator norm || • ||2. 

3.1 Sparse coding stability 

We begin with a fundamental forward stability result for the LASSO; we are not aware of any similar result 
in the literature. 

Theorem 1 (Sparse Coding Stability). Let A > 0, x G Brj, and D, D G (Br-j)'' with \Xs{D), ^s(D) > [i 
for \L> 0. If there are t > and s G [k] such that: 

('■) ||CPD(X)||0<S, (4) 

(//) \{(pDix))j\>TforalljGA, (5) 

(///) \{Dj,x- D(Pd(x)) I < a - t for all j ^ A; (6) 

l|f-6|l.<e<sn:^, (7) 
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then supp(cpd(x)) = supp{(p^{x)) 

t ( \fs 



and ||9d(x) - (Po(x)||2 < 



Condition (6) means that atoms not used in the coding (Pd(x) cannot have too high absolute correlation with 
the residual x— D(Pd(x). Note that the right-most term of (7) is the permissible radius of perturbation (PRP). 
In short, the theorem says that if LASSO(A, D,x) admits a stable sparse solution, then a small perturbation 
to the dictionary will not change the support of the solution, and the perturbation to the solution will be 
bounded by a constant factor times the size of the perturbation (where the constant depends on a condition 
number to a restricted problem, the amount of ^i-regularization, and the sparsity level). 

Proof sketch: Our strategy is to show that there is a unique solution to the perturbed problem, defined in 
terms of the optimality conditions of the LASSO (see conditions LI and L2 of (Asif and Romberg, 2010)), 
and this solution has the same support as the solution to the original problem. As a result, the perturbed 
solution's proximity to the original solution is governed in part by a condition number |j, of a linear system 
of s variables. ■ 



3.2 Symmetrization by ghost sample for random subclasses 

The next result is essentially due to Mcndelson and Philips (2004); it applies symmetrization by a ghost 
sample for random subclasses. Our main departure is that we allow the random subclass to depend on a 
second, unlabeled sample x". The lemma will be applied to the overcomplete setting in the simpler form 
without x" and to the infinite-dimensional setting in its full form. 

Lemma 1 (Symmetrization by Ghost Sample). Let F{z,x") C T be a random subclass which can 
depend on both an labeled sample z and an unlabeled sample x". 
Ifm> (f)^ then 

Pr,,. {3f e ^(z,x") P - Pz > t} < 2Pr„v' {^f G ^(z,x") P,. - Pz > ^} • 



4 Overcomplete setting 

Classically, the overcomplete setting is the modus operandi in sparse coding. Since k > d in this setting, an 
ideal learning bound has minimal dependence on k. We derive learning bounds with a square root dependence 
on k, below which the possibility of further improvement is open even in the reconstructive setting^. At 
a high level, our strategy for the overcomplete case learning bound is to construct an epsilon-cover over a 
subclass of the space of functions J- :— {f — (D, w) : D E V, w E W} and to show that the metric entropy 
of this subclass is of order dk. The main difficulty is that an epsilon-cover over V need not approximate J- 
to any degree, unless one has a notion of encoder stability (which we describe in this section). Our analysis 
effectively will be concerned only with a training sample and a ghost sample; hence, similar to the style of 
the luckiness framework of Shawe- Taylor et al. (1998), should we observe that the sufficient conditions for 
encoder stability hold true on the training sample, it is enough to guarantee that most points in a ghost 
sample also satisfy these conditions (possibly at a weaker level). 

4.1 Useful conditions and subclasses 

We first establish two important conditions. Letting f = (D, i/v) e 2? x W, define 

/4s(/^, x) := < max II (pD(x;)j|o < s > and Ct(f , x) :=< max max\Jj;(D, x,) < A — t 

1^ x,ex J 1^ x,ex je[k] 

-"^The only works known to attack the reconstructive setting are Maurer and Pontil (2008) and Vainsenchcr et al. (2011). 
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where ^\)j{D,x) := \{Dj,x— D(Pd(x))| — |((Pd(-^))_/|- The first condition is critical as the learning bound 
will exploit the sparsity level over the training sample; the second condition can be motivated as follows. 
Fix some x g B^d and assume Cr(/^,x) is true. If J £ A, the LASSO optimality condition LI of Asif and 
Romberg (2010) implies that |(Dj,x— D(Pd{x))\ — A; since ^\)j{D,x) is true, it follows that condition (5) 
holds. If J ^ A, then since {(pD{x))j = the condition i])j(D,x) is precisely the condition (6). Hence, the 
condition i|)j(D,x) < A — t is the key encoder-stability inequality which will need to be enforced for each 
coordinate. Since the form of xJjj is independent of j, all coordinates can be treated identically independent 
of their membership in A. 

For notational compactness, define Ts,T:{f, x) — As{f,x) A CT(f , x) and 

Tt,,5,t(^>x) = ($x.cx. |x| >ri A Vx e X As{f,x) V Q(f,x)) , 

where E is the negation of a boolean expression E. 

Define the level a-coding margin Xa,{D,x) :— max{T' > : Cccr'iD.x)} for oc > and the coding margin 
t(D,x) := Ti(D,x); clearly Ta(D,x) = (x-^x{D,x). For convenience we often make the first argument f 
rather than D, using the overloaded definition T(f, x) := t(D,x) where f = (D, w). Our bounds will require 
a crucial PRP-based condition that depends on both the learned dictionary and the training sample: 

t(D, x) > l(A, [I, £, s) for l(A, ^, e, s) = 3£ ( j" + X ^ ' 

For brevity we will refer to l with its parameters implicit; the dependence on e, s, A, and [i will be a nonissue 
because we first develop bounds with all these quantities fixed a priori. 

Finally, we will use the subclasses 2?^ := {D £ V : M.s(D) > and J^^ ;= {f = (D, n/) e J" ; D e 2?^}, 
where |j. > is fixed. 

4.2 Learning bound 

The following proposition is a specialization of Lemma 1 with x" taken as the empty set and the random 
subclass defined as J^{z, x") := {f e J^^. : Ts_i{f, x)}. 

Proposition 1. If m > (f then 

Pr, {3f e n^f. x) A (P - Pz > t)} 

< 2Pr„, {3f e J■^ n,,(f , x) A (P^' - Pz > t/2)} • ■ 

" V ' 

event J 

Define Z as the event that there exists a hypothesis with stable codings on the original sample but more 
than r\{m, d, k, e, 5) points with unstable codings on the ghost sample: 

Z ;= {zz' : 3f e n,,(f,x) A (Bx C x' |x| > ri(m, /c, e, 6) A Vx G x n,^3(f,,)(f , x)] } . 

We will show that Pr(J) is small by use of the fact that 

Pr(J) = Pr{J n Z) + Pr( J n Z) < Pr(J n Z) + Pr(Z). 

So far, our strategy is similar to the beginning of Shawe-Taylor et al.'s proof of the main luckiness framework 
learning bound (see Shawe-Taylor et al., 1998, Theorem 5.22). We show that Pr(Z) and Pr(Jn Z) are small 
in turn and then present the main learning bound. 

The next lemma shadows Shawe-Taylor et al. (1998) 's notion of probable smoothness and vanquishes the 
key difficulty in bounding Pr(Z). 
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Lemma 2 (Good Ghost). Fix A > and s £ [k]. Let x ^ P™ and x' ^ P'" be iid samples of m points 
each. Choose any D e for which As{D, x). With probability at least 1 — 5 at least m — r\{m, d, k, e, 5) 
points X C x' satisfy As{D, x) and Ct:^^d,x){D, x), for 

96s 2 
Ti(m, d, /c, e, 8) 2d/c log — — + log(2m + 1) + 2 log -. 

A\1X[U,\) b 

Proof sketch: The core idea of the proof is to show that if D satisfies encoder stabihty at level s and coding 
margin t, then there exists a representative D' in an e-cover of — composed of balls with radius linear 
in i — which satisfies encoder stability at level s and a coding margin reduced by a constant factor. Next, 
a standard permutation argument (Vapnik and Chervonenkis, 1968, Proof of Theorem 2), along with a VC 
dimension argument to entertain all choices of the coding margin t, guarantees with high probability that 
the representative D' also satisfies the same stability properties on most of the ghost sample. Finally, we 
once again jump from D' to D to guarantee that, on this good part of the ghost sample where D' satisfies 
s-sparsity and slightly reduced coding margin, D also satisfies s-sparsity and some further reduced coding 
margin. ■ 

It remains to bound Pr(J n Z). We use the shorthand r\ — r\{m, d, k, e, 5) for conciseness. 

Lemma 3 (Large Deviation on Good Ghost). Let CD := t/2-(^2/.(3 + and |3 := £ (^(^ + 1) + X 
Then 



Pr( J n Z) < ' ' j exp(-ma)7(26')). 

Equivalently, the difference between the loss on z and the loss on z' is greater than CD + 2/.(3 + ^ with 
probability at most (^^ilZ^^^^^y''^^''' exp(-mCDV(2i)2)). 

Proof sketch: From the definition of J n Z, the entire training sample and all but r| points of the ghost 
sample are "good" (i.e. have stable codings). Thus, there exists a representative f in a product of £-covers 
of V and W which closely approximates (with error at most 2/. (3, from the Sparse Coding Stability Theorem 
(Theorem 1)) the function evaluation of all good points and suffers error at most ^ on the bad points. 
Hence, for there to exist a large deviation t' between the training and ghost sample, there would need to 
be a deviation of at least t' — (2/.p + ^) for at least one hypothesis in the product of e-covers. The result 
follows from Hoeffding's inequality and the union bound. ■ 

A learning bound is within reach. Let e = ^; then from some elementary manipulations: 

Theorem 2. If m > ^ (^^T^ + ^'^^ all f e T such that \is{f) > \i, As{f,x), and Q(f,x), with 

probability at least 1 — 6 over z: 



(P-Pz)/* <2b\ 



'2((c/ + l)/clog(8m) + /clog § + log f) 
m 



Now, we construct a bound for each choice of s and To each choice of s G [k] assign prior probability 
P To each choice of / G N U {0} for 2^' < \i assign prior probability (/ + 1)^^. For a given choice of s G [k] 

and 2"' < |j. we use 5(s, /) := ^ (yq^^S- Note that J2s=o S/^o ^i^- ') = ^ since ^ = ^ ■ 

Define v(|j,) ;= min{l, 2L'°S2 i^J} for > 0. The following corollary is now proved. 
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Corollary 1 (Overcomplete Learning Bound). For any s E [k], for all f E T such that /4s(f, x), 



Q(f,x), and m > ^^^f^ 



, 1 

(f,x) \viMf)) ^ A 



1 ) , with probabiUty at least 1 — 8 over z: 



(P-Pz)/* < 26^ 



2 (c/ + l)/clog(8m) + /clog § + log 



2"^(l°g2^(Ti(75))''< 



35 



4L 



2b 
m 



I 



2dk log 



96s 



V 



Ay(h,(OM^x) 



log(2m+ 1) + 2log 



471^ log2 



36 



5 Infinite-dimensional setting 

Often in learning problems, we first map the data implicitly to a space of very high dimension or even 
infinite dimension and use kernels for efficient computations. In these cases where d ^ k or d is infinite, 
it is unacceptable for any learning bound to exhibit dependence on d. Unfortunately, the strategy of the 
previous section breaks down in the infinite-dimensional setting because the straightforward construction of 
any epsilon-cover over the space of dictionaries had cardinality that depends on d. Even worse, epsilon-covers 
actually were used both to approximate the function class J- in || • ||oo norm and to guarantee that most 
points of the ghost sample are good provided that all points of the training sample were good (the Good 
Ghost Lemma (Lemma 2)). 

These issues can be overcome by requiring an additional, unlabeled sample — a device often justified in 
supervised learning problems because unlabeled data may be inexpensive and yet quite helpeful — and by 
switching to more sophisticated techniques based on conditional Rademacher and Gaussian averages. After 
learning a hypothesis f from a predictive sparse coding algorithm, the sparsity level and coding margin are 
measured on a second, unlabeled sample x" of m points^. Since this sample is independent of the choice of 

^ 2 Iq 2 

f, it is possible to guarantee that all but a very small fraction (^ = of points of a ghost sample z are 

good with probability 1 — 6. In the likely case of this good event, and for a fixed sample, we then consider all 
possible choices of a set oft) bad indices in the ghost sample; each of the (™) cases corresponds to a subclass 
of functions. We then approximate each subclass by a special e-cover that is a disjoint union of a finite 
number of special subclasses; for each of these smaller subclasses, we bound the conditional Rademacher 
average by exploiting a sparsity property. 

5.1 Symmetrization and decomposition 

Let \i* E and define J^^» — {f E . {[isif) > [i*) A ([a.2s(f) > M-2s)}- The next result is immediate from 
Lemma 1 where T{z, x") := {f} n {f E : Ts,r{f. x) A n,T:(f , x")}. 

Proposition 2. If m > (f)^, then 

Przx" {f e J-^. r,,,(f,x) A n,,(f,x") A P 4 - Pz > t} (8) 
< 2Pr„,,,, {fET^, n,,(f,x) A r,,,(f,x") A (Pz' - Pz > ^)} • ■ 

^The cardinality matches the size of the training sample z purely for simplicity. 
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Now, observe that the probabihty of interest can be spHt into the probabihty of a large deviation hap- 
pening under a "good" event and the probabihty of a "bad" event occurring: 

Przz'x" {^e-F^. r,,,(f,x)A n,,(f,x")A (Pz'^*-Pz4> ^)} 

= Przz'x" {^e^^. n,T(^x) A n,,(f,x") A T^.s.tC^x') a (Pz' - Pz 4 > 0} 

+ Przz'x" {^e-F^- n,,(f,x)A Ts,r{f. x") A r^.sAf. X') A (Pz' - Pz 4 > I) } 

< Przz'{3f e J-^- n,,(f,x)A r^,s,.(f,x') A (Pz'f*-Pz/* > ^)} 

+ Prx'x" {feT^, n,,(f , x") A r^,s,.(^x')} . 

We treat the first probabihty in the next subsection. To bound the second probabihty, note that for each 
choice of x, is a fixed function. Hence, it is sufficient to select ri such that, for any fixed function f E 

Prx'x" {n,,(f,x") A r^,s,.(f,x')} < 6. 

Lemma 4 (Unlikely Bad Ghost). Let f <E T he fixed. Ifr\ = 2 log |, then 

Prx'x" {n,,(f,x") A r^,s,.(f,x')} < 6. 

Proof sketch: The proof just uses the same standard permutation argument from the proof of the Good 
Ghost Lemma (Lemma 2). ■ 

5.2 Rademacher bound in the case of the good event 

We now bound the probabihty of a large deviation in the (likely) case of the good event. For read- 
ability, let J^^. r(x) be shorthand for {f S J-"^* : TsT:(f, x)}. Likewise, let J^^»,r (x) be shorthand for 
{f G iFy,* : T^ s,T:{f, x)}. Let cJi, . . . , ff^ be independent Rademacher random variables distributed uniformly 
on {-1,1}. 

Lemma 5 (Symmetrization by Random Signs). 

Przz' {3f e TsAf. x) A r^,s,x(f , x') A (Pz' - Pz > } 

<Prz,a| sup - V(T,4(y/,f(x,)) > U +Pr,,<, J sup - V ff,-4)(y,-, /^(x,)) > U . 

Proof sketch: The proof uses a standard application of symmetrization by random signs. ■ 

Note that J^|j..,r(x) is just J^^.,ro(x), and so we need only bound the second term of the last line above 
for arbitrary r| € [m]. Before bounding this quantity, which for fixed x is a conditional Rademacher average 
(see Appendix D.l for the fundamentals of Rademacher and Gaussian averages), we need to establish a few 
results on the Gaussian average of a related function class. 

First, note that for any D G V, D can be decomposed into a product of two matrices as D = US, where all 
U eU C R'"^'' satisfy the isometry property U'^ U = / and S € 5 := (Sk*)'' (Maurer and Pontil, 2008). Fix 
S, a linear hypothesis w £ W, and a sample x of m points. The subclass of interest will be those functions 
corresponding to U eU such that ||(pus(x)||o < s. It turns out that the Gaussian average of this subclass is 
well-behaved. 
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Theorem 3 (Gaussian Average for Fixed S and w). Let S E S, s E [k], and x be a fixed m-sample. 
DeEne a particular subclass ofU as {U E U : As{US, x)}. Then 



2 >^ , , ArkV2s 
Ey sup — > 7/ w, (pus{xi)) < ,c\ / — ^) 

The proof of this result uses the following lemma that shows how the difference between the feature maps 
(pus a-iid ^>u'S can be characterized by the difference between U and U'. Define the s-restricted 2-norm of 
S as ||S||2,5 := sup{fgR„^||t||^i |3,pp(t)|<,} |lSt|l2. 

Lemma 6 (Difference Bound For Isometries). Let x E Bj^d. If || <Pl's(-'<')IIo ^ s and || cpu's(-x')||o 1^ 
then 

Wvusix) - Vu's{x)\\ < ^%^||(t^'^ - t^^)x||2. 

M-2s(.--'j 

Proof sketch: The proof uses a perturbation analysis of solutions to linearly constrained positive definite 
quadratic programs (Daniel, 1973), exploiting the sparsity of the optimal solutions to have dependence only 
on ||S||2,2s and |X2s(5) rather than ||S||2 and ^ifc(S). ■ 

Proof sketch (of Theorem 3): The proof controls the Gaussian average of the function class Z^x, which is 
nonlinear in U, by the Gaussian average of a second function class that is linear in U. Lemma 6 is critical 
in establishing a link between the variances of the Gaussian process induced by each function class, after 
which Slepian's Lemma (Lemma 9) relates the corresponding Gaussian averages. By exploiting the isometry 
property oiU^, the Gaussian average of the second function class can admit a dimension-free bound. ■ 

The palette for the pivotal art of this section is established. 

Theorem 4 (Rademacher Average of Mostly Good Random Subclasses). 

\l/(/c+l)\ 



for t,:^'--u(^i4 + l) + l^ 



Proof sketch: We decompose the space of dictionaries T> via U and S and split this function class into N — (™) 
subclasses, each of which has the property that all functions in the class have a common set of m — r\ good 
indices. We then approximate each subclass by a product of £-covers of W and (a suitably incoherent subset 
of) iS. The error from this approximation is low due to the Sparse Coding Stability Theorem (Theorem 1). 
Fixing a subclass and a representative from each of the two £-covers induces a smaller subclass, and there is 
a index set of at least m — r\ points that are sparsely coded for all encoders in this smaller subclass. Theorem 
3 then essentially implies a bound on the Rademacher complexity of each smaller subclass. The union bound 
over [N] and the two £-covers finishes the result. ■ 



For the case of rj = 0, define 

Since J-^r,x is equivalent to J-To.x, Lemma 5 and Theorem 4 imply that 
Przz' {^f e J-^. n,,(f , x) A r^,sAf. x') A (Pz' - Pz > ^) } 

^ [^^e ) ^M-mti/i2b')) + Q (^^^^ j eM-mtl/i2b')) 

<2( JfAA^^ j exp{-mtl/{2b% 
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Finally, the full probability (8) can be upper bounded (using r\ — 2 log |) as: 



Przx" {feT^, n,,(f,x) A n,,(f,x") A p4 - Pz > t} 



After some elementary manipulations and choosing e = ^, we nearly have the final learning bound. Let 
L'(A,^t,m,s) :=l(±±^ + i + 

Theorem 5. Let [i*, [i2s > 0, s e [k], and x > l'(A, [x*, m, s). Suppose an algorithm is trained on a labeled 
sample z of m points and learns hypothesis f. Let x" be a second, milabeled sample of m points. Suppose 
that M.2s(0 ^ M-2s' M-s(0 ^ M-si '^s(^. x), As{f,x"), ^(f, x), and C-r(f, x") all hold. Then with probability at 
least 1 — 6 over z and x": 



(p _ p < ^LV^rkV^ , / 2 ((^ + log(8m) + /clog § + (2 log m + 3) log f ] 



\ 1\ 4fa, 16 

1 + T + — l°g ■ 



m\|j.*\A / Ay m 5 

A bound adaptive to the sparsity level and coding margin on z and x" is now immediate. Recall that 
= min{l, 2Li°g2 ^^J } for ^ > 0. 

Corollary 2 (Infinite-Dimensional Learning Bound). Suppose an algorithm is trained on a labeled 
sample z of m points and learns hypothesis f. Let x" be a second, unlabeled sample of m points. Choose 

s = max |s' e [k] : 4s'(f , x) A /\s'(f , x")| and t= min |t' e M+ : ^-(^x) A Cr'{f.x")^ . 

Let \ls '■= y{[is{ f) i^nd jlas ■v(M-2s(0- If'^^ l'(A, |J.s(0, m, s) and H2s(0 > 0; t-hen witii probability at least 
1 — 6 over z and x" : 



ILJnrkJs 



2 ( (/c2 + /c) log(8m) + /clog § + (2 log m + 3) log 44(iofo(H.) iog,(fe))^<c 
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A /^/^ I I n I 4b^^^ 44(log2(fi,)log2(fes)f /c 



6 Discussion 

We have shown the first generalization error bounds for predictive sparse coding. The results highlight the 
central role of the stability of the sparse encoder. The presented bounds are data-dependent and exploit 
properties relating to the training sample and the learned hypothesis. A learning bound appropriate for the 
overcomplete setting exhibits square root dependence on both the ambient dimension d and the size of the 
dictionary k. In the infinite-dimensional setting, a dimension- free learning bound has linear dependence on 
k, square root dependence on s, and inverse dependence on the 2s-incoherence. 

Maurer and Pontil (2008) previously showed the following generalization bound for unsupervised (^i- 
regularized) sparse coding: 

pJsupPfo-Pxfo > ^f? + ^v^^^(IWA^V\/^¥^Us- 
yo^v \/in \ X 2 / V 2m J 
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where foi^) '■= minzgR'' 11^— ^'^lli+'^lklli- Comparing their result to Corollary 2 and neglecting regularization 
parameters, the dimension- free bound in the predictive case is larger by a factor of -y/s- It is unclear whether 
the ^/s term is avoidable in the predictive setting. At least from our analysis, it appears that the \/s factor 
is the price of stability. 

The data-dependent stability conditions under which our bounds hold may seem restrictive. In the case 
where the coding margin over the training sample is not uniformly large, or the sparsity level over the 
training sample is not uniformly small, bounds based on unlikely large deviations from the expectations 
of these quantities can be established. We did not pursue such bounds here because in predictive sparse 
coding all training samples empirically do have a low sparsity level and high s-incoherence. Also, the analysis 
becomes considerably more involved but we anticipate the learning bounds to be of similar order. 

If one entertains a mixture of and ^2 norm regularization as in the elastic net (Zou and Hastie, 2005), 
fall-back guarantees are possible in both scenarios. A considerably simpler, data-independent analysis is 
possible in the overcomplete setting with a final bound that essentially just trades \i.s{D) for the ^2-norm 
regularization parameter A2. In the infinite-dimensional setting, a simpler non-data-dependent analysis 
using our approach would only attain a bound of the larger order -^^^ ■ In conclusion, the presented data- 
dependent bounds provide motivation for an algorithm to prefer dictionaries for which small subdictionaries 
are well-conditioned and to additionally encourage large coding margin on the training sample. 
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A Proof of Sparse Coding Stability Theorem 

Proof (of Theorem 1): Let a be equal to {vd{x))a, and define the scaled sign vector C '■— Asign(a). Our 
strategy will be to show that, for some A S W , the optimal perturbed solution ^q{x) satisfies {<Pq{x))a = 
<x+ A and (cpo(x))^c = 0, where A"" := [k] \ A. 

From the optimality conditions for the LASSO (e.g. see optimality conditions LI and L2 of Asif and 
Romberg (2010)), it is sufficient to find A such that 

{Dj,x-DA{cc+A))^^j ifjeA 



[Dj,x- Da{<x+ A)) 



< A otherwise. 



We proceed by setting up the linear system and characterizing the solution vector A: 

~ T- « Solve for A „ „ , _ 

D_J(x-D^(a+/i)) = C > A^{DIDa)-\D\{x~Da(x)-Q. 

Since D = D + E for ||E||2 < e, 

Dl{x - D_A<x) = (D^ + E^)^(x - (D^ + E^)a) 

= D_X(x - Daoc) - D^Eaoc + E_X(x - {Da + E^)a) 
= C - D_jE^a + Ej"(x - [Da + E^)a), 



13 



and so the solution for A can be reformulated as 

A = {DlDj,)-\-D\Ej,<K+El{x-{Dj, + EM). 

Now, 

11^112 < ||D;D^)-i||2(||D^E^a||2 + ||£i(x - (D^ + E^)a)||2) 
|j. A 



+ 1 



For y e W, let y^yci be the extension to MJ^ satisfying (y/<xi)^ = y and {ykxilA^ = 0. For {ot + A)kxi to 
be optimal for LASS0(A, D,x), (a + ^)fcxi niust satisiy the two optimality conditions and A must be small 
enough such that sign consistency holds between a and (a+ ^) (i.e. sign(ay) = sign(ay + Aj) for all j € [s]). 

We first check the optimality conditions. The first optimaUty condition is equivalent to 

(Dy,x- D^(a + ^)) = A forje^; 

this condition is satisfied by construction. The second optimality condition is equivalent to 

{Dj,x- D^(a + A)) < A for J ^ A. 

But for j ^ A, 

{Dj, X - D^(a + ^)) I = I {Dj + Ej.x-{Da + Ea){<x+A))\ 

= \{Dj. X - Daoc) - {Dj. DaA) + (§, X - (D^ + E^)(a + A)) - {Dj. Ej^{oc+A))\ 



< A-T + 



+ 1 +£ + 



A - T+ £ 



1 



^ A 

and so this condition is satisfied provided that 



+ Vs 1 



1 < T. 



Now, we check sign consistency. Clearly sign consistency holds over A"^. It remains to check that it holds 
over A. Observe that 



Hence, sign consistency holds provided that 

|a,-| > £ 



All the above constraints are satisfied if x satisfies 



+ ^-M|<T. 
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B Proof of Symmetrization by Ghost Sample Lemma 



Proof (of Lemma 1): Replace F{<Jn) from the notation of Mendclson and Philips (2004) with J^(z,x"). A 
modified one-sided version of (Mcndelson and Philips, 2004, Lemma 2.2) that uses the more favorable 
Chebyshev-Cantelli inequality implies that, for every t > 0: 

4supfgjp- Var(cp o f) + mt-^ ) 



<Przz'x" {3f e J-(z,x") P,. - Pz > ^} 



Furthermore, because we assume a bounded loss function with losses trapped in [0,fa], it follows that 
supfgjr Var(4) of) < ^. The lemma follows since the left hand term of the LHS of the above inequality 
is at least \ whenever m > (|)^. ■ 



C Proofs for overcomplete setting 

Proof (of the Good Ghost Lemma (Lemma 2)): By the assumptions of the lemma, we consider D satisfying 
M.s(D) > p. and /4s(D,x). 

We want to guarantee with high probability that for D all but t] points of the ghost sample (1) are coded 
at sparsity level s, (2) have encodings whose non-zero- valued coordinates have absolute value greater than 
T3(D, x), and (3) have encodings whose zero- valued coordinates have absolute correlation of the corresponding 
atom with the residual (i.e. |(D;,x/ — Dq)D(x/))|) less than A — T3(D, x). 

The latter two inequality conditions are the raisons d'etre of ij). Consider the class of threshold functions 



defined as 



^stab(^) 1 1: if maxjeW X) < A - t, 

''^ 1 0; otherwise. 



Since the VC dimension of the one-dimensional threshold functions is 1, it follows that the VC(J^p^'') = 1. 

Consider a minimum-cardinality (^tti^^J^^)" proper cover V of I?^. Let D' be a candidate ele- 
ment of V satisfying \\D — D'\\2 < t+^^^t^j— . Then the Sparse Coding Stability Theorem (Theorem 1) 
implies /4s(D',x) and Q^^jCd.x)^^', x). First, we ensure that most points from the ghost sample satisfy 
maxjg[/<] ^\>j{D' , •) < A — x^i2{D, x). By using the VC dimension of J-q^'° and the standard permutation argu- 
ment of (Vapnik and Chcrvonenkis, 1968, Proof of Theorem 2), it follows that for a single, fixed element of 
I?', with probability at least 1 — 5 at most log(2m -I- 1) -I- log j points from a ghost sample will violate the 
inequality. Hence, by the bound on the proper covering numbers provided by Proposition 3 (see Appendix 
E), we can we can guarantee for all candidate members D' G V that with probability 1 — | at most 

96s 2 

< dk log — — + log(2m + 1) + log - 

A|j,T(L',x) 

points from the ghost sample violate the inequality (in the above, we use the generally loose bound (o. < s 
and assume without loss of generality"^ that A < 1. 



'if A > 1, then all sparse codes will equal the zero vector. 
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It also is necessary to guarantee with high probability that most points of the ghost sample are coded 
sparsely. Since s is fixed, for a single element of T)' we need only bound the probability that a single function 



1; if ||(Pd(><)||o < s 
0; otherwise 



takes the value 1 for all points in the training sample but takes the value of for some number of points in 
the ghost sample. Indeed a standard permutation argument shows that log | or more points of the ghost 
sample will violate the sparsity condition with probability at most | . 

Thus, for arbitrary D' G 2?' satisfying the conditions of the lemma, with probability 1 — 5 at most 

96s 2 
Ti(m, d, k, £, 5) 2dk log — — + log(2m + 1) + 2 log - 

points from the ghost sample violate /4s(D',x') or ^^^^(d x'). 

Now, consider the at least m — r[{m, d, k, e, 6) points in the ghost sample that satisfy both As{D\ •) and 
^T3/2(f,x)(^'' ')■ Since \\D' — D\\2 < s+l^'^'^)-'^ , the Sparse Coding Stability Theorem (Theorem 1) implies 



-T3/2(f,x)l.t> . Mice \\u - U\\2 sy:^^^^ 
that these points satisfy As{D, •) and Ct-3(d x)(^^. ')■ 

Proof (of Lemma 3): First, note that J n Z is a subset of 



R 



^feJ'^ n,,(f,x) 

A (^x C x', |x| > ri.Vx e X r,^r,if.x)if.x] 
A (Pz' - Pz > t/2) 



To see this, let bad ghost be shorthand for the condition of their being more than r\ points in the ghost 
sample that violate r^-^f^f ^^{f , •) (i.e., that arc "bad"), let good ghost be true if and only if bad ghost is false, 
and let P be shorthand for the condition Pz' f,\, — Pz /4> > t/2. Suppose that zz' E J Z. Hence, 3f E 
such that Ts,i{f, x) and P, and furthermore, $f E such that Ts^JJ , x) and bad ghost. Consider all f E 
such that TsL(f, x) and P. Suppose any one of these satisfies bad ghost. Then 3f E such that Ts^(f,x) 
and bad ghost, which implies that zz' ^ J n Z. Hence, if zz' e J n Z then 3f E such that Ts,SS , x), P ^ 
and good ghost. 

Now, choose a f = (D, w) E T^^ satisfying Ts L(f, x) and good ghost. Thus, if ||D— D'lla < £ and 
II 1/1/ — 1/1/' II2 < £, then for all but r\ of the points in the ghost sample (and for all points of the original sample) 
we are guaranteed that 

|(w, (Pd(x,)) - (1/1/', (PD'(X,))| 

< 1(1/1/ - w' , CPd(x,))| + |(m/', (Pd(x,) - (PD'(X;))| 

< ^ + f — (-^ + 1) (Sparse Coding Stability Theorem (Theorem 1)) 

For the rest of the ghost sample, there is a coarse guarantee that ||(Pd(><';) — 'Pd'(x,)||2 < \- Hence, on the 
original sample 



1 

- y] I4>(y/. (w, (Pd(x,))) - 4)(y/, (1/1/', (pD'(x;)))| < /.(3, 



m 

i=l 
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where the loss cj) is /.-Lipschitz. On the ghost sample 



m 

i=i 



1 

- ^|4)(y;, {w. cpd(x;))) - cl)(y;. (^', cpo'(x;)))| 

< A I ^ |(>^,cpo(x;))-(>^^cpo'(x;))|| + l5]|cl)(y;,(^,q,o(x;)))-4,(y;^ 

m I — ^ / m ^ — ^ ' 

\ / good / / bad 



^ ,„ ri .[/.(£ + 2r) 
< /.|3 + — mm ^ ' 
m 



m 

The subclass of interest is 



^(x,x') :={fG J-^: ^..(f.x) A (jx C xMx| > t], Vx e x r,,^3(f,,)(f , x)) } . 

It is sufficient to consider the e-neighborhood of J^(x, x') defined such that ii f — (D, w) e J^(x, x'), then all 
f — {D' , w') satisfying \\D — D'\\2 < e and \\w ~ w'\\2 < £ are in the e-neighborhood. 

Let J-'e = X We, where 2?£ is a minimum-cardinality proper e-cover of and W is a minimum- 
cardinality £-cover of W. Note that the e-neighborhood will contain at least one element of J^c- Hence, it 
is sufficient to prove the large deviation bound for all of J^e and to then consider the maximum difference 
between an element of F(x, x') and its closest representative in J^^ (which of course is 0(e)). 

Now, for each f = {D,w) G J"^ satisfying Ts,i{f,x) and good ghost, there is a D' e such that 
II D — D'||2 < e and a. w' £ satisfying \\w — w'\\2 < e; hence, the difference between the losses of f and f 
on the double sample will be at most 2/.|3 + 

Let V be the absolute deviation between the loss of f on the original sample versus its loss on the ghost 
sample. Then the absolute deviation between the loss of f ' — [D' , w') on the original sample and the loss of 
f on the ghost sample is at least 

Hence, if v > t/2, then the absolute deviation between the loss of f on the original sample and the loss of 
f on the ghost sample must be at least 

Let fo'.w' be the hypothesis corresponding to [D' ,w'). It therefore is sufficient to bound the probability 
that 

Pr„, jaf = (D', w') eV.xW, Pz' - Pz > t/2 - (^2/.(3 + ^) } ■ 

since R implies the above event. 
Now, let CD := t/2 - (^2/.|3 + 

We first handle the case of a fixed f — [D' , w') G Df, x We. Just apply Hoeffding's inequality for the 
random variable <i){y, f{xi)) — <i>{y, ^(x/)) (range in [—b, b\) to yield 

Przz' {Pz' -Pzf^>a}< exp{-ma^/{2b^)). 

Apply Proposition 4 to extend this bound over all of 2?e x We via the union bound, yielding 

Przz' {3f = (D', 1^') e Pe X We Pz' f<i> - Pz > G)} 

<(^^ ) exp(-mQV(2fa2)). 
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Thus, 

Pr(J n Z) < f ^^^^ J exp(-mCDV(2ib2)). 

Proof (of Theorem 2): Proposition 1 and Lemmas 2 and 3 imply that 

Prz {3f e n,,(f , x) A (P - Pz > t)} 

Equivalently, 

Pr, {af c ^„ n„(f . X) A (p - P, > 2 + 2tp + '^'"' t''' ''" ))} 



Now, expand (3 and r| and replace 5 with 6/4: 
Przjaf e , x) A P - Pz 

>2 f . + 2U f ^ + 1) + i) + ^P''*'°^X^ + '°^P- + ') + ^'°^ 

\ V M- A Ay m 

^2(«W?^)"""e.p(-™„V(2.^), + ^. 

Choose f = |^ 8(r/2)^^/'-'+^) ^ '^+^)^ exp(-mCDV(2b^)), yielding 

Przjaf e^^ n,t(^x) 

A Pf^-Pzf* >2fG) + 2/.£f-(^ + l) ^ 



2 



|x A A 

^(2°"^ log + logCS'" + 1) + 2(c/ + l)/c log 8(772]bm) + ^ + 2 log 2) 



<4.(«:Z?^)'""'exp(-.c=V(2t')). 

which is equivalent to 

A P - Pz > 2 (^£D + 2/.£ (^^(^ + 1) + X 



^ fa(2d/c log 5^^^ - 2(d + l)/clog f + 2/c log f + log(2m + 1) + ^ + 2 log 2) 



m 



<4-^^^^ exp(-mro7(262)). 
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Let 6 (a new variable, not related to the previous incarnation of 5) be equal to the upper bound, and 
solve for CD, yielding: 



CD = fa^ 

and hence 



'2((a' + l)/clogf + /clog f + log l: 



A P - Pz > 2b\ 



/2((d + l)/clogf + /clog f + log f) 
m 

96s 



r y-s ^ ^ 1\ ^ b(2cf/clog^ + log(2^ + l) + 2logf) 
' ' jj, A Ay m 



< 6. 

If we set e = — , then provided that m > J- , 

PrjBfeT^ n,,(f,x) 



A P fc> - Pz > 2b\ 



/2((c/ + l)/clog(8m) + /clog§ + log|; 



< 6. 



D Infinite-dimensional setting 

D.l Rademacher and Gaussian averages and related results 

Let cri,...,ffm be independent Rademacher random variables distributed uniformly on {—1,1}, and let 
7i, . . - Ym be independent Gaussian random variables distributed as Af{0, 1). 

Given a sample of m points x, define the conditional Rademacher and Gaussian averages of a function 
class as 

2 2 
T^m\x{^) = — E sup V cT/f(x;) and Gm\x{J^) = — E sup Vy,f(x,). 

respectively. 

From Meir and Zhang (2003, Theorem 7), the loss-composed conditional Rademacher average of a function 
class T is bounded by the scaled conditional Rademacher average: 

Lemma 7 (Rademacher Loss Comparison Lemma). For every function class J-, sample of m points 
X, and cf) which is L-Lipschitz continuous in its second argument: 



19 



Additionally, from Ledoux and Talagrand (1991, a brief argument following Lemma 4.5), the conditional 
Rademacher average of a function class T is bounded up to a constant by the conditional Gaussian average 



Lemma 8 (Rademacher-Gaussian Average Comparison Lemma). For every function class J- and 
sample of m points x: 



The next relation is due to Slcpian (1962). 

Lemma 9 (Slepian's Lemma). Let Q and F he mean zero, separable Gaussian processes indexed by a 
set T such that E (i?^^ — i?^^) < E(rtj — Tt^) for all ti,t2 E T. Then Esupfg7-/2t < Esup^^T- F t- 

Slepian's Lemma essentially says that if the variance of one Gaussian process is bounded by the variance of 
another, then, in expectation, the sup of the first is bounded by the sup of the second. 
We also will make use of McDiarmid's inequality (McDiarmid, 1989). 

Theorem 6 (McDiarmid's Inequality). Let Xi, . . . , be random variables drawn iid according to a 
probability measure \x over a space X. Suppose that a function f : X"^ ^ R satisfies 



of T: 




sup I f (xi Xn) - f (xi x,_i , X,' , x,+i 



Xi,...,Xm,x' 



for any i G [m]. Then 



Prxi x„{f(Xi Xn) ~Ef{Xi Xn)>t}< exp -2tVI]cf 



m 




D.2 Proofs 



Proof (of Lemma 4): By definition, Prx'x" ) '^s.-vif,^") A Tr^^s,r{f ,^') \ is equal to 





From the permutation argument, if no point in x" violates As{f, •), then the probability that over log | 
points of x' will violate As{f, •) is at most |; in addition, if no point of x" violates C^{f, •), then the probability 
that over log | points of x' will violate Cr(f , •) is at most |. Thus, the probability that more than Tj = 2 log | 
points of x' violate either As{f, •) or Ct(f , •) is at most 6. ■ 



Proof (of Lemma 5): From the definitions of Tj ^^ and 




20 



Now, a routine application of symmctrization by random signs yields 



^ t 



' ZZ'.O" 



1 

Przz' sup - ^W)) - fix,))) 

1 " /-I 

sup - ^ (T,(ct,(y;, f (x;)) - 4)(y,-, ^(x,))) > . 

rWn^^.,r^(x') ~t 2 J 

sup - f2 ^(^')) ^ A r ( - E '^'■^W- ^(^/)) ^ a) 

< Pr,,^ \ sup - V cr/4)(y/, f (x,)) > ^ i + Prz.a < sup — V ff/4)(y/, f (x,)) > ^ i . ■ 

Proof (of Lemma 6): By definition, (pus(-^) = 3''gminzeE'< 11-^ ~ tyS\/||2 + A||z||i. First, note the equivalences 
arg min^gjj/, ||x — (752112 — arg min^ggi, || (J^x — USz\\2 = arg min^gRi || U^x — Sz\\2, where the first equality 
follows because any z e Ker(ty) (i.e., in the complement of the range space of U) will be orthogonal to USz, 
for any z; hence, it is sufficient to approximate the projection of x onto the range space of U. 

Thus, (pus{x) = SI'S min^gRk || U^x — Szjjj + A||z||i. Our approach is to use the well-known reformulation 
as a quadratic program with linear constraints: 

mmimizc Qu{z) z z-z x +A(0^l2;,)z 

subject to z+ > 0/( z^ > 0/t z — z+ + z^ = 0;<, 
where z :— z := {z^z^^ z^^)^ . 




For optimal solutions z* := | z+ | and := | f+ | of and Qw respectively, from Daniel (1973), 
we have 

{0- z,f\/Qu{z,)>0 (10) 

{u-uVVQu'{U)>0 (11) 

for all u e M?'^. Setting u to in (10) and to z* in (11) and adding (10) and (11) yields 

(t, - z,f{\/Quiz,) - VQu'iU)) > 0, 

which is equivalent to 

iU - z.fiVQu'ih) - yQu'iz.)) < iU - z,y{VQu{z*) - VQu'iz,)). (12) 
S^S 0k>,2k \ f 2S^U^ \ . ^ f Ok 



Here, V(Pl/(z) = „ n ^ ) ^ ^ i n M' + '^ i • After plugging in the expansions 

V 02kxk 02kx2k ) \ '^Ikxd J \ i-2k J 

of S/Qu and WQw and incurring cancellations from the zeros, (12) becomes 

- z.fiS'^Su - 25^(7' ^x - S^Sz, + 2S^L''^x) < {t, - z.fiS'^Sz, - 2S^L'^x - S^Sz, + 2S~^ U''^ x) 

[U - z,fS^S{U - z*) < 2iu - z.fS^iU'^ - U^)x. 

If and z* both have sparsity level at most s, then wherever we typically would consider the operator 
norm ||S||2 := supiifii j ||St||2, we instead need only consider the 2s-restricted operator norm ||S||2,2s- 
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Note that (f* — z*)^S^S(t* — z*) > |a.2s(S)||t* — z*|l2i which hiiphcs that 

^ "t. -Z.||||S||2,2.||(U'''-U^)x|| ^ lit. -Z.||2< ^"^!g' ||(»''"-»^)>||2- 



H2s(5) 



Proo/ fo/ Theorem 3): Define a Gaussian process Q, indexed by U, by Qu '■— X^/liT'i'^' ^(ys(^/))- Our goal 
is to apply Slepian's Lemma to bound the expectation of the supremum of of Q, which depends on cpus, by 
the expectation of the supremum of a Gaussian process F which depends only on U. 



m 



\ i=i 1=1 / 

m 

= Vus{xi) - VU's{Xi))f 

m 

< i'^^\\^>us{xi) - (PU's{Xi)f 
i=l 

Applying the result from Lemma 6, we have 

m 



< 



f2r\\S\\ 



2,2s 



V M-2s(5) 



2 m 



i=l 
2 m k 



(^)^.((|:|:v.<"'.,.>)-(|:|:..<"...> 



Ey {ru-Tu'f 



(13) 



for 



H2s(5) 



;=i j=i 
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By Slepian's Lemma, Ey supy Qy < Ey supy f y. It remains to bound £y supy F y: 
-^J^I^EySupru =ET,sup^^7/j(L'ej,x,) 

^'^ll ■J II 2,2s U U : 



Hence, 



,=1 j=i 

k 



= Ey su p ^ ( tye,- , ^ 7/;x,) 

k m 

<Ey sup^|jL/ej|j||^Y,jX,-| 

^ j=i ;=! 

m 

=k^y II "^JiiXiW 



<k 



\ 



Ey W^ynX; 



\ '=1 



P 2 " r ^\ ^ 4r||S||2,2s/c ^ 4r/c^ 
Eysup — > y,(w,(pus(x,)) < . -. ^ < 



H2s(S)^ M-2s(S)\/m' 
where we used the fact that ||S||2,2s < %/2s (see Lemma 10 in Appendix D.2 for a proof) 
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Proof (of Theorem 4)- In the course of the proof, we will use the following factorization of the space of 
dictionaries V that was introduced by Maurer and Pontil (2008). Each dictionary D G V will be factorized 
as D = US, where S G S .= (Suk)'' and 

UeU -.^iU' e R'""' : \\U'x\\2 = ||x||2 for xe M'<}. 

U is the set of semi-orthogonal matrices in M''^''; these matrices are left isometries (i.e. they satisfy U = /). 
Let Si- be a minimum-cardinality proper e-cover (in operator norm) of {S € 5 : M.s(S) > [o.*, M.2s(5) > M-2s}j 
the set of suitably incoherent elements of S. 

Now, onward to the proof. The goal is to bound 



Prx.cT < sup - V cT;4)(y,-, f (x,)) > ^ \ 



We say an index / is good if and only if T5 ^(f , x,). An index is bad if and only if it is not good. Consider 
a fixed sample z and the occurrence of a set of m — ri good indices; each of the remaining indices can be either 
good or bad. There are N := (^) ways to choose this set of indices. Partition J-"|j.*,r,i (x) into N subclasses 

via ^^^',r^ (x) — Ujie[w] ■^ii" r i^) where, for all functions in a given subclass, a specific set of m — ri indices is 

guaranteed to be good. More formally, we can choose distinct good index sets Ti T/y, each of size m — r\, 

such that for each J: 

rjC fl {/gH: n,,(^x,)}. 

Clearly, 

m m 

sup cj;4)(y/, f(x,)) = max sup cT/4)(y;, f (x,)). 

For each j e [A/], define the e-neighborhood of J-^^, y (x) defined as 

fl^ r^S^) := \j^{US', w') : \\5 - S'\\ < e,\\w-w'\\ < e.Se 5, 1/1/ G W.{US,w) e -T^ji-.r^W} ■ 
Also, let We be a minimum-cardinality £-cover of W and define a particular epsilon-cover of J-: 

:= {f = {US', w')eT : U eU,S' eSe.w' eWc}. 
Take the intersection J^^, -y (x) n J^£, a disjoint union of subclasses equal to 

U ^i^'Xi-) 

for 

W := •^i^r, W n {f e T : f ^ {US'. w'):UeU}. 
Now, for each j e [A/] and arbitrary cr G { — 1, 1}'", we compare 

^ m m 

sup — cr/4)(y/, f (x/)) and max sup (j/4)(y/, f (x,)). 



24 



Without loss of generality, choose J — 1 and take Ti = [m — r|]. If f G J^^* r.^C'^)' follows that there 
exists f e {Js'es^.w'ew, ^l^-'r^' i''^) such that 

1 

— \'^^y'' ^d(x,)>) - 4)(y/, (w', (Pd'(x,)))| 

/=i 

L /'""'^ \ 1 

- m ( ^ ^ ^"^'^ ] + — ^' ^d(x,))) - 4)(y/, (w', (Pd'(x;)))| 

\ i=l ) /=m-T) + l 

Vh* A XJ m 

where the last line is due to the Sparse Coding Stability Theorem (Theorem 1). 
Therefore, for any cr e { — 1, 1}"' it holds that 

1 

sup — Vff/4)(y/,f(x,)) 
< max max sup — V ff,ct)(y/, f (x,)) + /.£ ( — + 1) + | ) + — . 



Hence, for z fixed 



Pra \ sup — V ff/4)(y;, f (x,)) > I \ 



< Pr<j i max max sup — V (j/4)(y/, f (x,)) > - /.e f — + 1) + - — 

,^|8(,/2)V(«...y'^'" _ 



je[A/] 



PrJ sup lf^cT,-4)(y/,f(x,))> '-/-e f 4(^ + 1) 



Now, from McDiarmid's inequality, for any fixed J E [A/], S' E and w' E We- 

m ^ m ^ 

sup - VcT/4)(y/,f(x,)) > E,, sup -Y^cf!4>{yi.f{xi)) + ti} <exp{-mt^/{2b^)). 

Now, we bound 

1 

E(j sup — ^ (j/4)(y/, ^(x,)). 
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Without loss of generality, again take j — 1 and Ti = [m — r]]. Then 
E 



1 

sup — o-/4)(y/, f(x,)) 

f 1 1 " 1 

cT < sup — V (J/4)(y/, f (x,)) + — V ff,ct)(y/, f (x,)) ^ 

(x) /=i /=m-Ti+i J 

cTi cT„_^ < sup — ^ o-/4)(y/. ^(x;)) > + Ea„_^+i a„ < sup — ^ o-/4)(y/. > 

f 1 1 

cTi (T„_^ < sup — ^ (T/4)(y/, > 

Ue^;;^;:M'" /=i J 



fan 



Now, Theorem 3, the Rademacher Loss Comparison Lemma (Lemma 7), and the Rademacher-Gaussian 
Average Comparison Lemma (Lemma 8) imply that 

1 "s-^ ./ , , \x\ -v/'" ^ 2LJnrkJ~s 

Ea sup — > CJ;4)(y/, 1/1/, (Pus X/) ) < — — 

/\s((y's,x) '-■ ^ 

M.2s(S)Vm ' 

and hence 

Pr. I sup 1 V cT,-4)(f(x,)) > ^^f^y + ^ + ti 1 < exp(-mt2/(2fa2)). 

Combining this bound with the fact that the bound is independent of the draw of z and applying 
Proposition 4 (with d = k) to extend the bound over all choices of j, S' , and w' results in 



for 
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Lemma 10. If S e (BrO , then \\S\\2.2s < v2s 



Proof: Define S/\ as the submatrix of S that selects the columns indexed by A. Similarly, for t e M'' define 
the coordinate projection t/\ of t. 

sup |lSt|l2 

{t:||t||=l,|supp(t)|<2s} 

= min sup ||5/it/i||2 

{/\C[/c]:|/\|<2s}{t. t =l,supp(t)C/\} 



mm sup 

{/\C[/c]:|/\|<2s}{t. t =l,5upp(t)C/\} 



< min sup |tuj|||5a,||2 

{/\C[/c]:|/\|<25}{t; t =l,supp(t)C/\} 

< min sup y \tw\ 

{/\C[/c]:|/\|<25}{t; t =l,supp(t)C4} 

< min sup 

{/\C[/c]:|/\|<2s}{t. t =l,5upp(t)C/\} 



< min sup V2s||t/\||2 

{/\C[/c]:|/\|<2s}{t. t =l,5upp(t)C/l} 



= V2s 



E Covering numbers 

Cucker and Smale (2002, Chapter I, Proposition 5) state that, for a Banach space E of dimension d, the 
e-covering numbers of the radius r ball of E are bounded as Af{rBE, e) < {A-r/cY . 

For spaces of dictionaries obeying some deterministic property, such as 2?^ = {D e I? : [>-s{D) > [i}, one 
must be careful to use a proper £-cover so that the representative elements of the cover also obey the desired 
property: A proper cover is more restricted than a cover in that a proper cover must be a subset of the set 
being covered (rather than simply being a subset of the ambient Banach space). That is, if /4 is a proper 
cover of a subset T of a Banach space E, then A C T. For a cover, we need only A C E. The following bound 
relates proper covering numbers to covering numbers (a simple proof can be found in (Vidyasagar, 2002, 
Lemma 2.1)): If E is a Banach space and T C E is a bounded subset, then Af{E, e, T) < Afproper{E, e/2, T). 

Let d, k e N. Define E^ :— {E e {Bg^d)'^ : lJ.s(D) > \i} and W :— rS^d. The following bounds derive 
directly from the above. 

Proposition 3. The proper e-covering number of E^ is bounded by 

dk 



Proposition 4. The product of the proper e-covering number of E^ and the e-covering number of W is 
bounded by 
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